Grade Clustering and Seriation of Words Based on Their Co- Occurrences
نویسندگان
چکیده
We present the use of grade correspondence analysis (GCA) in text mining. A sample of words extracted from 20 newsgroups has been linearly arranged according to concordance between their co-occurrence distributions. Words’ co-occurrence matrix, obtained using HAL (Hyperspace Analogue to Language) system and normalized to deemphasize too frequent terms, has been reordered by the GCA algorithm, implemented in the GradeStat program. The aim of this reordering was to approach TP2 regularity of dependence, measured using Kendall’s tau. Deviations from regularity have been used to order words into series, cluster them and visualize using overrepresentation maps. Seriation provided a contextual scale between computerrelated terms and politicsand religion-related terms. Words appearing in various contexts occupy average positions on this scale. Word ordering based on similarity between their occurrence patterns can be used in thesauri building and query extension in information retrieval applications.
منابع مشابه
یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملA Seriation Approach for Visualization-Driven Discovery of Co-Expression Patterns in Serial Analysis of Gene Expression (SAGE) Data
BACKGROUND Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives. PRINCIPAL FINDINGS Here we...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملCombining Syntactic Co-occurrences and Nearest Neighbours in Distributional Methods to Remedy Data Sparseness.
The task of automatically acquiring semantically related words have led people to study distributional similarity. The distributional hypothesis states that words that are similar share similar contexts. In this paper we present a technique that aims at improving the performance of a syntax-based distributional method by augmenting the original input of the system (syntactic co-occurrences) wit...
متن کاملThe Intellectual Structure of Knowledge in the Field of Distance Education Using the Co-Word analyses
Background: Co- word analysis is one of the content analysis methods used in scientometric studies and mapping the scientific structure of various fields. The purpose of the present research is to map the structure of distance education using the co-word analysis. Methods: The research method is content analysis using co- word analysis. The research population are 31607 documents indexed in the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006